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Abstract. Although the real reproducing kernels are used in an increas- 
ing number of machine learning problems, complex kernels have not, 
yet, been used, in spite of their potential interest in applications such 

("V^ ' as communications. In this work, we focus our attention on the complex 

Cn , gaussian kernel and its possible application in the complex Kernel LMS 

^^' algorithm. In order to derive the gradients needed to develop the complex 

C^ ' kernel LMS (CKLMS), we employ the powerful tool of Wirtinger's Cal- 

culus, which has recently attracted much attention in the signal process- 
ing community. Writinger's calculus simplifies computations and offers 

V^ , an elegant tool for treating complex signals. To this end, the notion of 

Writinger's calculus is extended to include complex RKHSs. Experiments 
verify that the CKLMS offers significant performance improvements over 
the traditional complex LMS or Widely Linear complex LMS (WL-LMS) 

|_^ ■ algorithms, when dealing with nonlinearities. 

O . Key words: Kernel Methods, LMS, Reproducing Kernel Hilbert Spaces, 

Complex Kernels, Wirtinger Calculus, Kernels 
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1 Introduction 



> 

a^ 

OO ■ In recent years, kernel based algorithms have become the state of the art for 

^^ I many problems, especially in the machine learning community. The common 

ly^ ■ feature of these problems is that they are casted as optimization problems over a 

^D I Reproducing Kernel Hilbert Space (RKHS). The main advantage of mobilizing 

^^ ■ the tool of RKHSs is that the original nonlinear task is "transformed" into 

a linear one, where one can employ an easier "algebra" . Moreover, different 
types of nonlinearities can be treated in a unifying way, that does not affect the 
derivation of the algorithms, except at the final implementation stage. The main 
concepts of this procedure can be summarized in the following two steps: 1) Map 
C^ ' the finite dimensionality input data from the input space F (usually F C M") 

into a higher dimensionality (possibly infinite) RKHS H and 2) Perform a linear 
processing (e.g., adaptive filtering) on the mapped data in H. The procedure is 
equivalent with a non-linear processing (non-linear filtering) in F. 

An alternative way of describing this process is through the popular kernel 
trick [T], [5]: "Given an algorithm, which is formulated in terms of dot prod- 
ucts, one can construct an alternative algorithm by replacing each one of the 
dot products with a positive definite kernel k" . The specific choice of kernel, 
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implicitly, defines a RKHS with an appropriate inner product. Furthermore, the 
choice of a kernel also defines the type of nonlinearity that underlies the model 
to be used. Although there are several kernels available in the relative literature, 
in most cases the powerful real Gaussian kernel is adopted. 

The main representatives of this class of algorithms are the celebrated support 
vector machines (SVMs), which have dominated the research in machine learning 
over the last decade. Moreover, processing in Reproducing Kernel Hilbert Spaces 
(RKHSs) in the context of online adaptive processing is also gaining in popularity 
within the signal processing community [3], U, [5], 0, [7]. Besides SVMs and 
the more recent applications in adaptive filtering, there is a plethora of other 
scientific domains that have gained from adopting kernel methods (e.g., image 
processing and denoising [S], [5], principal component analysis [TU], clustering 
m, e.t.c). 

Although the real Gaussian RBF kernel is quite popular in the aforemen- 
tioned context, the existence of the corresponding complex Gaussian kernel is 
relatively unknown to the machine learning community. This is partly due to 
the fact, that in classification tasks (which is the dominant application of kernel 
methods) the use of complex kernels is prohibitive, since no arrangement can be 
derived in complex domains and the necessary separating hyperplane of SVMs 
cannot be defined. Consequently, all known kernel based applications, since they 
emerged from the specific background, use real- valued kernels and they are able 
to deal with real valued data sequences only. While the complex gaussian RBF 
kernel is known to the mathematicians (especially those working on Reproducing 
Kernel Hilbert Spaces or Functional Analysis), it has remained in obscurity in 
the machine learning society. In this paper, however, we use the complex gaus- 
sian kernel to address the problem of adaptive filtering of complex signals in 
RKHSs, focusing on the recently developed Kernel LMS (KLMS) [3], [H]. The 
main goals of this paper are: a) to elevate from obscurity the complex Gaussian 
kernel as an effective tool for kernel based adaptive processing of complex sig- 
nals, b) the extension of Wirtinger's Calculus in complex RKHSs as a means for 
the elegant and efficient computation of the gradients, that are involved in many 
adaptive filtering algorithms, and c) the development of the Complex Kernel 
LMS (CKLMS) algorithm, by exploiting the extension of Wirtinger's calculus 
and the RKHS of complex gaussian kernels. Wirtinger's calculus [13] is enjoying 
increasing popularity, recently, mainly in the context of Widely Linear complex 
adaptive filters [M], [15], [16], [17], [18], providing a tool for the derivation of 
gradients in the complex domain. 

The paper is organized as follows. In section [2] we provide a minimal in- 
troduction to complex RKHSs focusing on the complex gaussian kernel and its 
relation with the real one. Next, in section [3] we summarize the main notions of 
the extended Wirtinger's Calculus. Section |4] presents the gaussian complex ker- 
nel LMS algorithm. Finally, experimental results and conclusions are provided 
in Section [S] We will denote the set of all real and complex numbers by M and 
C respectively. Vector or matrix valued quantities appear in boldfaced symbols. 
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2 Reproducing Kernel Hilbert Spaces 

In this section we briefly describe the Reproducing Kernel Hilbert Spaces. Since 
we are mainly interested on the complex case, we recall the basic facts on RKHS 
associated with complex kernels. The material presented here may be found with 
more details in TJ] and [5D]. Given a function k : X x X ^ C and xi, . . . , xn £ 
X, the matri>o K = {Kij)'^ with elements Kij — K{xi,Xj), for i,j — I, . . . ,N, 
is called the Gram matrix (or kernel matrix) of k, with respect to xi, . . . , x^- A 
complex hermitian matrix K = (Kij)^ satisfying 



N,N 



c" -K-c^ > , 4c,K,,, > 0, 



for all Ci e C, i = 1, . . . , A^, is called Positive Definite^. Let X be a nonempty set. 
Then a function k : XxX ^ C, which for all A^ G N and all a:i , . . . , xn € X gives 
rise to a positive definite Gram matrix K is called a Positive Definite Kernel. 
In the following we will frequently refer to a positive definite kernel simply as 
kernel. 

Next, consider a linear class H of complex valued functions / defined on a 
set X. Suppose further, that in H we can define an inner product (-,•)« with 
corresponding norm || • ||-h and that H is complete with respect to that norm, i.e., 
H is a Hilbert space. We call H a Reproducing Kernel Hilbert Space (RKHS), 
if for all a; € A the evaluation functional T^ : H ^ C : T^if) = f{x) is 
a continuous (or, equivalently, bounded) operator. If this is true, then by the 
Riesz's representation theorem, for all x S X there is a function g^ € H. such 
that T^if) = f{x) = {f,gx)H- The function k : X x A -J> C : K(y, a:) = 5^(y) is 
called a reproducing kernel oi %. It can be easily proved that the function k is 
a positive definite kernel. 

Alternatively, we can define a RKHS as a Hilbert space % for which there 
exists a function k : X x X — )■ C with the following two properties: 

1. For every x e X, k{-, x) belongs to H. 

2. K has the so called reproducing property, i.e. 

f{x)^{f,K{;x))n,ioTallfeH, (1) 

in particular k{x, y) — {k{-, y), k{-,x))-h- 

It has been proved (see [5T]) that to every positive definite kernel k, there cor- 
responds one and only one class of functions % with a uniquely determined inner 
product in it, forming a Hilbert space and admitting k as a reproducing kernel. 
In fact the kernel k produces the entire space %, i.e., % — span{K(x, ■)\x e X}. 
The map <!> : X —^ T-L : <^(x) = k(-,x) is called the feature map of T-L. Recall, 
that in the case of complex Hilbert spaces the inner product is sesqui-linear and 



^ The term (Kij)'^ denotes a square N x N matrix. 

^ In matrix analysis literature, this is the definition of the positive semidefinite matrix. 
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Hermitian. In the real case the condition K{x,y) = {K{-,y), k{-^x))-u may be re- 
placed by the well known equation k{x, y) = (k(-, x), k{-, y))n- However, since in 
the complex case the inner product is Hermitian, the aforementioned condition 
is equivalent to K{x,y) = ((«(-, x), k(-,2/))-h)*. 
Consider the complex valued function 

^c,,c4^,t^):=exp('-^ti^|5^^V (2) 

defined on C'* x C^, where z,w E C*, Zi denotes the i-th component of the 
complex vector z £ C^ and exp is the extended exponential function in the 
complex domain. It can be shown that Kg-^cd is a C- valued kernel on C*, which 
we call the complex Gaussian kernel with parameter a. Its restriction Ka := 
(n^CdY^^ „^ is the well known real Gaussian kernel: 



(3) 

An explicit description of the RKHSs of these kernels, together with some im- 
portant properties can be found in [22] . 

3 Wirtinger's Calculus in complex RKHS 

Wirtinger's calculus [13] has become very popular in the signal processing com- 
munity mainly in the context of complex adaptive filtering [T3] , [33] , [TS] , [TB] , 
[24] , as a means of computing, in an elegant way, gradients of real valued cost 
functions defined on complex domains (C). The Cauchy-Riemann conditions 
dictate that such functions are not holomorphic and therefore the complex 
derivative cannot be used. Instead, if we consider that the cost function is defined 
on a Euclidean domain with a double dimensionality (R^''), then the real deriva- 
tives may be employed. The price of this approach is that the computations 
become cumbersome and tedious. Wirtinger's calculus provides an alternative 
equivalent formulation, that is based on simple rules and principles and which 
bears a great resemblance to the rules of the standard complex derivative. A self- 
consistent presentation of the main ideas of Wirtinger's calculus may be found in 
the excellent and highly recommended introductory report of K. Kreutz-Delgado 

m- 

In the case of a simple non-holomorphic complex function T defined on [/ C 
C, Wirtinger's calculus considers two forms of derivatives, the ^-derivative and 
the conjugate M.- derivative, which are defined as follows: 

dT 1 / du dv\ i / dv du 

dz 2 \dx dy J 2 \dx dy 

dT 1 f du dv\ i f dv du 

dz* 2 \dx dy J 2 \dx dy 
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where T(z) = T{x + iy) ~ T{x, y) — u{x, y) + iv{x, y). Note that any such non- 
holomorphic function can be written in the form T(z, z*). Having this in mind, 
4j, can be easily evaluated as the standard complex partial derivative taken with 
respect to z (thus treating z* as a constant). Consequently, ^ is evaluated as 
the standard complex partial derivative taken with respect to z* (thus treating z 
as a constant). For example, '-dT{z,z*) = z{z*Y , then |j = (z*)^, ^ = 2zz*. 
Similar principles and rules hold for a function of many complex variables (i.e., 

Wirtinger's calculus has been developed only for operators defined on finite 
dimensional spaces, C^. Hence, this calculus cannot be used in RKH spaces, 
where the dimensionality of the function space can be infinite, as, for example, 
it is the case for the Gaussian RKHSs. To this end, Wirtinger's calculus needs 
to be generalized to a general Hilbert space. A rigorous presentation of this 
extension is out of the scope of the paper (due to lack of space). Nevertheless, 
we will present the main ideas and results. We employ the Frechet derivative, 
a notion that generalizes differentiability on abstract Banach or Hilbert spaces. 
Consider a Hilbert space H over the field F (typically M or C). The operator 
T : iJ — > _F is said to be Frechet differentiable at /o, if there exists a, u £ H, 
such that 

hm nf. + h) nf,)^iu.h),^ 

\\h\\H^a \\h\\H 



where (•,■)// is the dot product of the Hilbert space H and || • \\h — y/{-, ■)h is 
the induced norm. The element u is usually called the gradient of T at /q. 

Assume that T = (Ti,r2)^, T(/) = T{h + »/2) = T(/i, h) - T^{fu h) + 
iT2{fi, /2), is differentiable as an operator defined on the RKHS H and let ViTi, 
V2T1, V1T2 and V2T2 be the partial derivatives, with respect to the first (/i) and 
the second (/2) variable respectively. It turns out, proofs are omitted due to lack 
of space, that if T(/i, /2) has derivatives of any order, then it can be written in 
the form T(/, /*), where f* — fi~ i/2, so that for fixed f*, T is /-holomorphic 
and for fixed /, T is /* -holomorphic. We may define the M-derivative and the 
conjugate R-derivative of T as follows: 

VfT - i (ViTi + V2T2) + '- (V1T2 - V2T1) (5) 

V/*T =. i (ViTi - V2T2) + ^ (V1T2 + V2T1) . (6) 

The following properties can be proved (among others): 
f . The first order Taylor expansion around / G "H is given by 

r(/ + h) =T{f) + {h, {VfT{f)r)n + {h\ {VrTif)r)n. 

2. If T{f) ^ (/, w)n, then VfT = w* , Vf.T = 0. 

3. If T(/) = {w, /)«, then VfT = 0, V^.T = w. 

4. If T{f) = {r,w)-H, then VfT = 0, V^.T = w* . 
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5. If T(/) = {w, /*)„, then VfT - w, V/.T - 0. 

An important consequence of the above properties is that if T is a real valued 
operator defined on "H, then its first order Taylor's expansion is given by: 

T{f + h)^ T{f) + {h, {WfT{f)r)n + {h*, {WrT{f)r)n 
= T{f) + {h, VrTif))n + {{h, VrTif))nr 

= T{f) + 2-n[{h,^rT{f))n]. 

However, in view of the Cauchy Riemann inequality we have: 

^[{h,\/rT{f))n] < \{h,Vf.T{f))n\ < \\h\\n ■ ||V/.r(/)|l„. 

The equality in the above relationship holds ii h (x V fT. Hence, the direction 
of increase of T is Vf*T{f). Therefore, any gradient descent based algorithm 
minimizing T(/) is based on the update scheme: 

/„ = /„_i-/^-V^.T(/„_i). (7) 

4 Complex Kernel LMS 

As an application of the complex gaussian kernel in adaptive filtering of complex 
signals, we focus on the recently developed Kernel Least Mean Squares Algorithm 
(KLMS), which is the LMS algorithm in RKHSs [3,, JJ,. KLMS, as aU the known 
kernel methods that use real-valued kernels, was developed for real valued data 
sequences only. Here, the KLMS is extended to include the complex case. To our 
knowledge, no kernel-based strategy has been developed, so far, that is able to 
effectively deal with complex valued signals. Wirtinger's calculus is exploited to 
derive the necessary gradient updates. 

Consider the sequence of examples {z{l),d{l)), {z{2),d{2)), . . . , iz{N), d{N)), 
where d{n) £ C, z{n) e V^ C C, z{n) — x{n) + iy{n), x{n),y{n) E M'^, for 
n = 1, . . . , A^. We map the points z{n) to the gaussian complex RKHS H using 
the feature map <!>, ior n = 1, . . . ,N. The objective of the complex Kernel LMS 
is to minimize E [Cn{w)], where 

£„{w) = |e(n)|2 = \d{n) - {•P{z{n)),w)M\'' 

= {d{n) - {${z{n)), w)m) {d{n) - {<P{z{n)), w}m)* 
= {d{n) - {w*,'P{z{n)))M) {d{nr - {w,'P{z{n)))M) , 

at each instance n. We then apply the complex LMS to the transformed data, 
using the rules of Wirtinger's calculus to compute the gradient Vi„.£„(i(;) = 
—e{n)* ■ •P{z{n)). Therefore the CKLMS update rule becomes w{n) — w{n — 
1) -I- iJLe{n)* ■ (!>{z{n)), where w{n) denotes the estimate at iteration n. 

Assuming that w{0) — 0, the repeated application of the weight-update 
equation gives: 

n 

«;(n)^^e(fc)*<?(^(fc)). (8) 

fc=i 
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Thus, the filter output at iteration n becomes: 

n-l 

d{n) ={<P{z{n)),w{n - 1))h = M 5] <k){^{z{n)), ^{z{k)))u 

k=l 
n-l 

=fi ^ e(fc)K^x- {z{n), z{k)). 
fc=i 

It can readily be shown that, since the CKLMS is the complex LMS in RKHS, 
the important properties of the LMS (convergence in the mean, misadjustment, 
e.t.c.) carry over to CKLMS. Furthermore, note that using the complex gaus- 
sian kernel the algorithm is automatically normalized. The CKLMS algorithm 
is summarized in Algorithm [TJ Although it is developed in the context of the 
complex gaussian kernel, it may be used with any other complex reproducing 
kernel. 



Algorithm 1 Normalized Complex Kernel LMS 
INPUT: (z(l), d(l)), . . . , (z(iV), d{N)) 
OUTPUT: The expansion 

Initialization: Set a = {}, Z = {} (i.e., iv — 0). Select the step parameter fi, 
and the parameter a of the complex gaussian kernel. 
for n=l:N do 

Compute the filter output: d(n) = Ylk=i '^(^) ' i'^cT.C'^iz{n),z{k)). 
Compute the error: e(n) = d{n) — d{n). 
a{n) = ne{n). 

Add the new center z{n) to the list of centers, i.e., add z{n) to the list Z, add 
a{n) to the list a. 
end for 



In CKLMS, we start from an empty set (usually called the dictionary) and 
gradually add new samples to that set, to form a summation similar to the one 
shown in equation (|8]). This results to an increasing memory and computational 
requirements, as time evolves. To cope with this problem and to produce sparse 
solutions, we employ the well known novelty criterion [26] . |12j . In novelty cri- 
terion online sparsification, whenever a new data pair {^{zn), dn) is considered, 
a decision is immediately made of whether to add the new center ^{Zn) to the 
dictionary of centers C. The decision is reached following two simple rules. First, 
the distance of the new center 'P{zn) from the current dictionary is evaluated: 
dis = mincfc£c{||^(-2n) — CfcHn}. If this distance is smaller than a given threshold 
(5i (i.e., the new center is close to the existing dictionary), then the center is not 
added to C. Otherwise, we compute the prediction error e„ = d„ — (i„. If |e„| is 
smaller than a predefined threshold ^2, then the new center is discarded. Only 
if |e„| > (52 the new center 'l>{zn) is added to the dictionary. 
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Fig. 1. The equalization problem. 



5 Experiments 



We tested the CKLMS using a simple nonlinear channel equalization problem 
(see figure [T]). The nonlinear channel consists of a linear filter: t{n) = (—0.9 + 
0.8i) • s{n) + (0.6 — 0.7i) • s{n — 1) and a memoryless nonlinearity q{n) = t{n) + 
(0.1 + 0.15i) • t'^{n) + (0.06 + 0.05i) • t^{n). At the receiver end of the channel, 
the signal is corrupted by white Gaussian noise and then observed as r{n). The 
input signal that was fed to the channel had the form 



s{n) = 0.70(v/l-p2X(n) + ipY{n)), 



(9) 



where X{n) and Y{n) are gaussian random variables. This input is circular 
for p = \f2l2 and highly non-circular if p approaches or 1 [T3]. The aim of 
channel equalization is to construct an inverse filter which taking the output 
r(n)^ reproduces the original input signal with as low an error rate as possible. 
To this end we apply the NCKLMS algorithm to the set of samples 

((r(n + D), r(n + D - 1), . . . , r{n + D- L)f, s{n)) , 

where L > is the filter length and D the equalization time delay. 

Experiments were conducted on a set of 5000 samples of the input signal ([9| 
considering both the circular and the non-circular case. The results are compared 
with the NCLMS and the WL-NCLMS algorithms. In all algorithms the step 
update parameter /i is tuned for best possible results. Time delay D was also set 
for optimality. Figure [5] shows the learning curves of the NCKLMS with a — 5, 
compared with the NCLMS and the WL-NCLMS algorithms. Novelty criterion 
was applied to the CKLMS for sparsification with Si — 0.1 and 82 = 0.2. In 
both examples, CKLMS considerably outperforms both the NCLMS and the 
WL-NCLMS algorithms. However, this enhanced behavior comes at a price in 
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computational complexity, since the CKLMS requires the evaluation of the kernel 
function on a growing number of training examples. 




(a) 



(b) 



Fig. 2. Learning curves for KCLMS, {fi = 1) CLMS {fi = 1/16) and WL-CLMS 
{fi = 1/16) (filter length L — 5, delay Z? = 2) in the nonlinear channel equaliza- 
tion, for the (a) circular input case and (b) the non-circular input case. 
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